Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel execution of Benchmark #124

Merged
merged 6 commits into from
Mar 10, 2023
Merged

Parallel execution of Benchmark #124

merged 6 commits into from
Mar 10, 2023

Conversation

juanmc2005
Copy link
Owner

@juanmc2005 juanmc2005 commented Mar 9, 2023

This PR addresses issue #85.

Example usage

from diart.inference import Benchmark, Parallelize
from diart import OnlineSpeakerDiarization, PipelineConfig

config = PipelineConfig()
benchmark = Benchmark("/wav/dir", "/rttm/dir")
p_benchmark = Parallelize(benchmark, num_workers=4)
if __name__ == "__main__":  # Needed for multiprocessing
    p_benchmark(OnlineSpeakerDiarization, config)

Changelog

  • Add --num-workers argument to diart.benchmark
  • Add diart.inference.Parallelize, a wrapper for Benchmark to replace sequential execution with multiprocessing
  • Expose some new fine-grained methods in Benchmark so that Parallelize can reuse it
  • diart.stream now uses rich progress bars
  • Add diart.progress package with ProgressBar, RichProgressBar and TQDMProgressBar as adapters for each library
  • Chronometer can now be aware of the progress bar used so that it can print reports with the correct formatting
  • BasePipeline objects now must be able to communicate their associated configuration class (through the get_config_class() static method)
  • PipelineConfig.from_namespace() is now PipelineConfig.from_dict() and receives an easily serializable configuration so that workers can instantiate their own pipelines (entire models cannot be sent to child processes)
    • This dictionary needs to be documented and formalized, maybe as a data class. Otherwise its use can be confusing
  • Add parallelization example in README.md
  • Models are now lazy. They only load weights when required, making them lighter for inter-process communication

Future improvements and limitations

  • Optimizer is still not compatible with Parallelize because some progress bars break
  • Replace tqdm with rich as progress bars in both Benchmark and Optimizer (when not running in parallel)
  • Spawn segmentation and embedding models as services in separate processes so the GPU memory requirements go down from O(num_workers * model_size) to O(model_size)

@juanmc2005 juanmc2005 added the feature New feature or request label Mar 9, 2023
@juanmc2005 juanmc2005 added this to the Version 0.7 milestone Mar 9, 2023
@juanmc2005 juanmc2005 merged commit 4b744ed into develop Mar 10, 2023
@juanmc2005 juanmc2005 deleted the feat/multithread branch March 10, 2023 16:26
@juanmc2005 juanmc2005 mentioned this pull request Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant